An Information-Theoretic Sentence Similarity Metric

نویسندگان

John G. Mersch

R. Raymond Lang

چکیده

We describe an information theoretic-based metric for sentence similarity. The method uses the information content (IC) of dependency triples using corpus statistics generated by processing the Open American National Corpus (OANC) with the Stanford Parser. We define the similarity of two sentences as a function of (1) the similarity of their constituent dependency triples, and (2) the position of the triples in their respective dependency trees. We compare results of the algorithm to human judgments of similarity of 1725 sentence pairs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Composite Kernel Optimization in Semi-Supervised Metric

Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...

متن کامل

MAXSIM: An Automatic Metric for Machine Translation Evaluation Based on Maximum Similarity

This paper describes our participation in the NIST 2008 MetricsMATR Challenge, using our recently proposed automatic machine translation evaluation metric MAXSIM. The metric calculates a similarity score between a pair of English system-reference sentences by comparing information items such as ngrams across the sentence pair. Unlike most metrics, MAXSIM computes a similarity score between item...

متن کامل

An Effective Approach for Robust Metric Learning in the Presence of Label Noise

Many algorithms in machine learning, pattern recognition, and data mining are based on a similarity/distance measure. For example, the kNN classifier and clustering algorithms such as k-means require a similarity/distance function. Also, in Content-Based Information Retrieval (CBIR) systems, we need to rank the retrieved objects based on the similarity to the query. As generic measures such as ...

متن کامل

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain

Motivation The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summariza...

متن کامل

MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation

We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences. Unlike most metrics, we compute a similarity score between items across the two sentences. We then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence. Th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

An Information-Theoretic Sentence Similarity Metric

نویسندگان

چکیده

منابع مشابه

Composite Kernel Optimization in Semi-Supervised Metric

MAXSIM: An Automatic Metric for Machine Translation Evaluation Based on Maximum Similarity

An Effective Approach for Robust Metric Learning in the Presence of Label Noise

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain

MAXSIM: A Maximum Similarity Metric for Machine Translation Evaluation

عنوان ژورنال:

اشتراک گذاری